Hierarchical clustering

Create Data Set


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

x  = [1.88505201,0.75858685,0.53086046,2.10121118,2.90456146,2.82199243,1.21688824,2.08582494,2.80032271,2.8871096,1.89067363,1.05548585]
y = [1.83256566,0.84474922,0.9779429,1.81776092,0.91189043,0.90186282,1.19189881,1.8977981,1.09191789,1.02681764,2.48316704,1.01289176]

Plot Data


In [2]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
plt.scatter(x,y)
numberOfPoints = len(x)
labels = []
for pt in range(numberOfPoints):
    ax.annotate(pt, xy=(x[pt]+0.05, y[pt]+0.05))
    labels.append('Point ' +  str(pt))


Create Tuples

The function zip allows you to form tuples from the x,y arrays


In [3]:
zipped = zip(x, y)
print zipped


[(1.88505201, 1.83256566), (0.75858685, 0.84474922), (0.53086046, 0.9779429), (2.10121118, 1.81776092), (2.90456146, 0.91189043), (2.82199243, 0.90186282), (1.21688824, 1.19189881), (2.08582494, 1.8977981), (2.80032271, 1.09191789), (2.8871096, 1.02681764), (1.89067363, 2.48316704), (1.05548585, 1.01289176)]

Place data into a data frame


In [4]:
df = pd.DataFrame(zip(x, y), columns=['x', 'y'], index = labels)
df


Out[4]:
x y
Point 0 1.885052 1.832566
Point 1 0.758587 0.844749
Point 2 0.530860 0.977943
Point 3 2.101211 1.817761
Point 4 2.904561 0.911890
Point 5 2.821992 0.901863
Point 6 1.216888 1.191899
Point 7 2.085825 1.897798
Point 8 2.800323 1.091918
Point 9 2.887110 1.026818
Point 10 1.890674 2.483167
Point 11 1.055486 1.012892

Create distance matrix


In [5]:
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

distMatrix = pd.DataFrame(squareform(pdist(df, metric='euclidean')))
distMatrix


Out[5]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0.000000 1.498234 1.601317 0.216666 1.373697 1.320631 0.925687 0.211104 1.177404 1.285826 0.650626 1.166210
1 1.498234 0.000000 0.263818 1.658129 2.147025 2.064196 0.574937 1.694247 2.056642 2.136295 1.991490 0.341205
2 1.601317 0.263818 0.000000 1.780813 2.374620 2.292395 0.718618 1.806668 2.272322 2.356756 2.028495 0.525788
3 0.216666 1.658129 1.780813 0.000000 1.210774 1.165502 1.083388 0.081503 1.007772 1.115001 0.697919 1.319604
4 1.373697 2.147025 2.374620 1.210774 0.000000 0.083176 1.710744 1.281539 0.208028 0.116245 1.869994 1.851832
5 1.320631 2.064196 2.292395 1.165502 0.083176 0.000000 1.631098 1.238479 0.191286 0.140904 1.835178 1.769992
6 0.925687 0.574937 0.718618 1.083388 1.710744 1.631098 0.000000 1.119529 1.586588 1.678360 1.456489 0.241027
7 0.211104 1.694247 1.806668 0.081503 1.281539 1.238479 1.119529 0.000000 1.077010 1.183497 0.617042 1.358182
8 1.177404 2.056642 2.272322 1.007772 0.208028 0.191286 1.586588 1.077010 0.000000 0.108490 1.662238 1.746626
9 1.285826 2.136295 2.356756 1.115001 0.116245 0.140904 1.678360 1.183497 0.108490 0.000000 1.764607 1.831677
10 0.650626 1.991490 2.028495 0.697919 1.869994 1.835178 1.456489 0.617042 1.662238 1.764607 0.000000 1.690931
11 1.166210 0.341205 0.525788 1.319604 1.851832 1.769992 0.241027 1.358182 1.746626 1.831677 1.690931 0.000000

Perform hierarchical clustering

Performing hierarchical clustering and plotting dendrogram can be done with scipy.


In [6]:
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distMatrix, method='average'))


Plot unclustered data as a heatmap

Heatmaps help us visualize the data features.


In [7]:
fig = plt.figure(figsize=(8,8))

ax = fig.add_subplot(111)


cax = ax.matshow(df, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
ticks=np.arange(-.55,12,1)
plt.yticks(ticks)
ax.set_xticklabels([''] + list(df.columns))
ax.set_yticklabels(['']+list(df.index))


plt.show()


Plot heatmap of data with respect to custering


In [8]:
# reorder rows with respect to the clustering
df_rowclust = df.ix[R['leaves']]

# plot
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)

cax = ax.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
ticks=np.arange(-.55,12,1)
plt.yticks(ticks)
fig.colorbar(cax)

ax.set_xticklabels([''] + list(df_rowclust.columns))
ax.set_yticklabels([''] + list(df_rowclust.index))

plt.show()


Plot dendogram adjacent to heat map


In [9]:
# plot dendrogram
fig = plt.figure(figsize=(8,8))
axd = fig.add_axes([0.1,0.125,0.2,0.6])
row_dendr = dendrogram(linkage(distMatrix, method='average'),orientation = 'right') # makes dendrogram black (2))
axd.set_xticks([])
axd.set_yticks([])


# Remove axes spines from dendrogram
for i in axd.spines.values():
    i.set_visible(False)


# Plot heatmap
axm = fig.add_axes([0.005,0.1,0.6,0.6]) # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r', origin='lower')
fig.colorbar(cax)
ticks=np.arange(-.55,12,1)
plt.yticks(ticks)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

plt.show()


Seaborn

Seaborn is a statistical data visualization library that automates a lot of what we did previously and does so using nice visualizations.

http://stanford.edu/~mwaskom/software/seaborn/


In [10]:
import seaborn as sns
sns.clustermap(df,method='average')


Out[10]:
<seaborn.matrix.ClusterGrid at 0x111fb3610>

NBA Example


In [11]:
import pandas as pd

data = pd.read_csv('BasketBallStats.csv', index_col=0)
data.index = data.index.map(lambda x: x.strip())

# label source:https://en.wikipedia.org/wiki/Basketball_statistics
labels = ['Games', 'Minutes', 'Points', 'Field goals made',
          'Field goal attempts', 'Field goal percentage', 'Free throws made',
          'Free throws attempts', 'Free throws percentage',
          'Three-pointers made', 'Three-point attempt',
          'Three-point percentage', 'Offensive rebounds', 'Defensive rebounds',
          'Total rebounds', 'Assists', 'Steals', 'Blocks', 'Turnover',
          'Personal foul']
data.columns = labels

Data Normalization

It is important to standardize data when features have values that are disproportionate to other features (e.g., values of "Games" and "Minutes" are so much larger than the others).

One way to standardize data is to subtract the mean and divide by the standard deviation. This is called a z-score.


In [12]:
data_normalized = data
data_normalized = (data_normalized - data_normalized.mean())/data_normalized.std()

Exercise 1

Use seaborn clustermap function to perform hierarchical clustering and visualization of:

  1. The un-normalized NBA data (data)
  2. The normalized NBA data (data_normalized)

Who does Lebron James most look like, in terms of NBA stats?


In [13]:
# Your code goes here

Exercise 2

Experiment with different distance measures (e.g., complete, single, centroid)


In [ ]:
# Your code goes here